Return to Statistical Modelling Section
The data comes from Kaggle, the biggest online platform for machine learning enthusiasts hosting datasets and competitions around data science.
The Ames Housing dataset was compiled by Dean De Cock for use in data science education. It’s an incredible alternative for data scientists looking for a modernized and expanded version of the often cited Boston Housing dataset. With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges Kaggle users to predict the final price of each home. (Source).
Firstly, I start by loading the data. The first file is the one to be used for training, whereas the holdout will only be used for submission of out-of-sample predictions.
Looking at the data, it quickly becomes apparent that there is a huge number of predictors, namely 79, which can be used for the prediction of the sales price, some of which are numeric and some of which are characters.
names(data)
## [1] "Id" "MSSubClass" "MSZoning" "LotFrontage"
## [5] "LotArea" "Street" "Alley" "LotShape"
## [9] "LandContour" "Utilities" "LotConfig" "LandSlope"
## [13] "Neighborhood" "Condition1" "Condition2" "BldgType"
## [17] "HouseStyle" "OverallQual" "OverallCond" "YearBuilt"
## [21] "YearRemodAdd" "RoofStyle" "RoofMatl" "Exterior1st"
## [25] "Exterior2nd" "MasVnrType" "MasVnrArea" "ExterQual"
## [29] "ExterCond" "Foundation" "BsmtQual" "BsmtCond"
## [33] "BsmtExposure" "BsmtFinType1" "BsmtFinSF1" "BsmtFinType2"
## [37] "BsmtFinSF2" "BsmtUnfSF" "TotalBsmtSF" "Heating"
## [41] "HeatingQC" "CentralAir" "Electrical" "1stFlrSF"
## [45] "2ndFlrSF" "LowQualFinSF" "GrLivArea" "BsmtFullBath"
## [49] "BsmtHalfBath" "FullBath" "HalfBath" "BedroomAbvGr"
## [53] "KitchenAbvGr" "KitchenQual" "TotRmsAbvGrd" "Functional"
## [57] "Fireplaces" "FireplaceQu" "GarageType" "GarageYrBlt"
## [61] "GarageFinish" "GarageCars" "GarageArea" "GarageQual"
## [65] "GarageCond" "PavedDrive" "WoodDeckSF" "OpenPorchSF"
## [69] "EnclosedPorch" "3SsnPorch" "ScreenPorch" "PoolArea"
## [73] "PoolQC" "Fence" "MiscFeature" "MiscVal"
## [77] "MoSold" "YrSold" "SaleType" "SaleCondition"
## [81] "SalePrice"
Looking at the missingness in predictors:
colMeans(is.na(data)) %>%
tidy() %>%
rename(pct = x) %>%
mutate(names = fct_reorder(names, pct)) %>%
filter(pct > 0) %>%
ggplot(aes(pct, names)) +
geom_col(fill = "midnightblue") +
labs(title = "Missing Data In Variables",
subtitle = "Percent missingness calculated for each column
",
y = NULL,
x = NULL) +
scale_x_continuous(labels = scales::percent_format(),
limits = c(0,1)) +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 12),
plot.subtitle = element_text(face = "italic", colour = "grey50"))
Some of the variables have manageable missingness of less than 20%, whereas there exist some variables with too many missing values (>20%). These will be discarded. The rest will be imputed using either the median or the mode, depending on whether the variable is numeric or categorical.
data <- data %>%
select(-c(Alley, FireplaceQu, PoolQC, Fence, MiscFeature, Id))
For now, it looks like all other predictors are tidy. Feature engineering steps will be included in the recipe when building the model.
Given the huge amount of predictors, there is a high chance that multicollinearity and overfitting would be consequences of including everything. Therefore, let’s take a look at the correlation matrix from the GGally package to gauge relations of numeric predictors with the target variable.
data %>%
select_if(is.numeric) %>%
drop_na() %>%
ggcorr(label = T, label_size = 2, )
Proceeding to look at variance inflation factors:
data %>%
select_if(is.numeric) %>%
lm(formula = SalePrice ~ .) %>%
vif()
## Error in vif.default(.): there are aliased coefficients in the model
Straight away, the error tells me that there are perfectly linearly correlated predictors in the dataset. Let’s see which ones and remove them:
data %>%
select_if(is.numeric) %>%
lm(formula = SalePrice ~ .) %>%
alias()
## Model :
## SalePrice ~ MSSubClass + LotFrontage + LotArea + OverallQual +
## OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 +
## BsmtFinSF2 + BsmtUnfSF + TotalBsmtSF + `1stFlrSF` + `2ndFlrSF` +
## LowQualFinSF + GrLivArea + BsmtFullBath + BsmtHalfBath +
## FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd +
## Fireplaces + GarageYrBlt + GarageCars + GarageArea + WoodDeckSF +
## OpenPorchSF + EnclosedPorch + `3SsnPorch` + ScreenPorch +
## PoolArea + MiscVal + MoSold + YrSold
##
## Complete :
## (Intercept) MSSubClass LotFrontage LotArea OverallQual OverallCond
## TotalBsmtSF 0 0 0 0 0 0
## GrLivArea 0 0 0 0 0 0
## YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 BsmtFinSF2 BsmtUnfSF
## TotalBsmtSF 0 0 0 1 1 1
## GrLivArea 0 0 0 0 0 0
## `1stFlrSF` `2ndFlrSF` LowQualFinSF BsmtFullBath BsmtHalfBath
## TotalBsmtSF 0 0 0 0 0
## GrLivArea 1 1 1 0 0
## FullBath HalfBath BedroomAbvGr KitchenAbvGr TotRmsAbvGrd Fireplaces
## TotalBsmtSF 0 0 0 0 0 0
## GrLivArea 0 0 0 0 0 0
## GarageYrBlt GarageCars GarageArea WoodDeckSF OpenPorchSF
## TotalBsmtSF 0 0 0 0 0
## GrLivArea 0 0 0 0 0
## EnclosedPorch `3SsnPorch` ScreenPorch PoolArea MiscVal MoSold
## TotalBsmtSF 0 0 0 0 0 0
## GrLivArea 0 0 0 0 0 0
## YrSold
## TotalBsmtSF 0
## GrLivArea 0
data <- data %>%
select(-c(TotalBsmtSF, GrLivArea))
Running the variance inflation again:
data %>%
select_if(is.numeric) %>%
lm(formula = SalePrice ~ .) %>%
vif() %>%
tidy() %>%
rename(vif = x) %>%
mutate(names = fct_reorder(as.factor(names), vif)) %>%
ggplot(aes(vif, names)) +
geom_col(fill = "midnightblue") +
labs(title = "VIF For Numeric Predictors",
subtitle = NULL,
y = NULL,
x = NULL) +
theme_light() +
theme(plot.title = element_text(face = "bold", size = 12),
plot.subtitle = element_text(face = "italic", colour = "grey50"))
The combination of the Correlation matrix and the VIF reveals that there exists moderate to considerable multicollinearity in the model. However, no predictor exceeds the recognised threshold of 10 for VIF, therefore the removal of the two perfectly collinear numeric predictors in a previous step was sufficient.
Looking at the usefulness of the predictors, the correlation plot shows that is a handful of predictors with now correlation to the target variable. These are:
Removing these will reduce the complexity of the model and potentially reduce overfitting down the line.
data <- data %>%
select(-c(
data %>%
select_if(is.numeric) %>%
drop_na() %>%
cor() %>%
as_tibble() %>%
mutate(variable = names(.)) %>%
select(variable, SalePrice) %>%
filter(abs(SalePrice) < 0.1) %>%
pull(variable)
))
With the remaining variables, we can now dive into exploratory data analysis (EDA) and make some graphs to better understand the data. This will be done in the order of the variables in the data set, of which many will be left out to keep this post as concise as possible.
Starting with the most relevant: The Target.
data %>%
ggplot(aes(SalePrice)) +
geom_histogram(fill = "midnightblue", colour = "white") +
theme_bw() +
labs(title = "Target Variable Distribution",
subtitle = "Variable Name: SalePrice",
y = "Frequency") +
scale_x_continuous(labels = scales::dollar_format()) +
theme_light() +
theme(plot.title = element_text(face = "bold", size = 12),
plot.subtitle = element_text(face = "italic", colour = "grey50"))
As the target variable is not normally distributed and resembles a log-normal distribution, a recipe step to transform the target will be included at a later stage, also being best practice in applications with prices and non-negative values.
data %>%
count(MSZoning, sort = T)
Virtually all houses in the data set are residential, mostly differentiated by low versus high density areas.
data %>%
ggplot(aes(LotArea/10.764)) +
geom_histogram(fill = "midnightblue", colour = "white") +
theme_bw() +
labs(title = "Lot Area Distribution",
subtitle = "Converted to sqm",
y = "Frequency",
x = "LotArea (sqm)") +
scale_x_log10(labels = scales::comma_format()) +
theme_light() +
theme(plot.title = element_text(face = "bold", size = 12),
plot.subtitle = element_text(face = "italic", colour = "grey50"))
The neighbourhood is strongly affecting the price distribution:
data %>%
ggplot(aes(y = fct_reorder(as.factor(Neighborhood), -SalePrice),
x = SalePrice)) +
geom_boxplot(outlier.alpha = 0.4) +
theme_bw() +
labs(title = "Neighbourhoods and Sales Prices",
subtitle = NULL,
y = "Neighbourhood",
x = "Sales Price") +
scale_x_log10(labels = scales::dollar_format()) +
theme_light() +
theme(plot.title = element_text(face = "bold", size = 12),
plot.subtitle = element_text(face = "italic", colour = "grey50"))
Overall quality is even stronger:
data %>%
ggplot(aes(y = fct_reorder(as.factor(OverallQual), -SalePrice),
x = SalePrice)) +
geom_boxplot(outlier.alpha = 0.4) +
theme_bw() +
labs(title = "Overall Quality and Sales Prices",
subtitle = NULL,
x = "Sales Price",
y = "Overall Quality") +
scale_x_continuous(labels = scales::dollar_format()) +
theme_light() +
theme(plot.title = element_text(face = "bold", size = 12),
plot.subtitle = element_text(face = "italic", colour = "grey50"))
Due to the sheer number of predictors, the importance will be evaluated after training of the model. However, it has become clear that the overall quality of the house seems to be a very strong predictor.
Lastly, all character variables will be converted to factors:
data <- data %>%
mutate(across(where(is.character), as.factor))
First, the data is split into training and testing sets. Also, five-fold cross validation is employed for reliable calculation of performance metrics.
dt_split <- initial_split(data, strata = "SalePrice")
dt_train <- training(dt_split)
dt_test <- testing(dt_split)
folds <- vfold_cv(dt_train, v = 5)
Creating the recipe with steps for normalisation of numeric predictors as well as the imputation of missing values:
gb_rec <-
recipe(SalePrice ~ .,
data = dt_train) %>%
step_mutate(OverallGrade = OverallQual * OverallCond) %>%
step_num2factor(OverallQual, levels = as.character(1:10)) %>%
step_num2factor(OverallCond, levels = as.character(1:10)) %>%
step_impute_median(all_numeric_predictors()) %>%
step_impute_mode(all_nominal_predictors()) %>%
step_novel(all_nominal()) %>%
step_log(SalePrice, skip = TRUE) %>%
step_normalize(all_numeric_predictors()) %>%
step_dummy(all_nominal(), one_hot = TRUE) %>%
step_zv(all_numeric())
Setting up the model specifications with tuning options for hyperparameters:
gb_spec <-
boost_tree(
trees = 1000,
tree_depth = tune(),
min_n = tune(),
loss_reduction = tune(),
sample_size = tune(),
mtry = tune(),
learn_rate = tune()
) %>%
set_engine("xgboost", importance = "impurity") %>%
set_mode("regression")
In the model specification, you can specify the variable importance, which is calculated based on impurity in this case. Proceeding with setting up the workflow:
gb_wflow <-
workflow() %>%
add_recipe(gb_rec) %>%
add_model(gb_spec)
Setting up a space-filling design for time-efficient hyperparameter tuning:
gb_grid <-
grid_latin_hypercube(
tree_depth(),
min_n(),
loss_reduction(),
sample_size = sample_prop(),
finalize(mtry(), dt_train),
learn_rate(),
size = 50
)
Now, the hyperparameters can be trained with parallel computing in order to utilise more available computing power.
unregister_dopar <- function() {
env <- foreach:::.foreachGlobals
rm(list=ls(name=env), pos=env)
}
cl <- makePSOCKcluster(6)
registerDoParallel(cl)
gb_tune <- tune_grid(object = gb_wflow,
resamples = folds,
grid = gb_grid)
stopCluster(cl)
unregister_dopar()
We can now look at the training results of the previous step and pick the best configuration of hyperparameters.
# Look at tuning results
gb_tune %>%
show_best(metric = "rsq") %>%
select(-c(.estimator, .config))
# Visualise tuning results
gb_tune %>%
collect_metrics() %>%
filter(.metric == "rsq") %>%
select(mean, mtry:sample_size) %>%
pivot_longer(mtry:sample_size,
values_to = "value",
names_to = "parameter") %>%
ggplot(aes(value, mean, color = parameter)) +
geom_point(alpha = 0.8, show.legend = FALSE) +
facet_wrap(~parameter, scales = "free_x") +
labs(x = NULL, y = "rsq") +
theme_bw()
The best model out of training is selected and used for the final fit.
gb_final_wflow <- gb_wflow %>%
finalize_workflow(select_best(gb_tune, metric = "rmse"))
gb_final_fit <- gb_final_wflow %>%
last_fit(dt_split)
gb_final_fit %>% collect_metrics()
We can see that the \(R^2\) for the final out-of-sample fit is fairly high, indicating high predictive ability by the regressors employed for the prediction.
gb_final_fit %>%
pluck(".workflow", 1) %>%
extract_fit_parsnip() %>%
vi() %>%
slice_max(order_by = Importance, n = 20) %>%
ggplot(aes(Importance, reorder(Variable, Importance))) +
geom_col(fill = "midnightblue", colour = "white") +
labs(title = "Variable Importance",
subtitle = "Only the most important predictors are shown.",
y = "Predictor",
x = "Relative Variable Importance") +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 12),
plot.subtitle = element_text(face = "italic", colour = "grey50"))
Looking at variable importance, it becomes apparent that the most important variables for the prediction of the sales price of homes in Iowa are:
What is surprising is the original construction year being more important than the overall quality, despite showing lower correlation with SalePrice in the EDA. The latter might hint at non-linearity and interactions which got picked up by the gradient boosting model, that a linear correlation did not register.
gb_final_fit %>%
collect_predictions() %>%
ggplot(aes(SalePrice, exp(.pred))) +
geom_point(alpha = 0.6, colour = "midnightblue") +
geom_abline(lty = "dashed", colour = "grey50") +
labs(x = "True Sales Price",
y = "Predicted Sales Price",
title = "Predictions of Final Fit") +
scale_y_continuous(labels = scales::dollar_format(),
breaks = seq(0, 800000, 200000),
limits = c(0, 800000)) +
scale_x_continuous(labels = scales::dollar_format(),
breaks = seq(0, 800000, 200000),
limits = c(0, 800000)) +
coord_equal() +
theme_bw() +
theme(plot.title = element_text(face = "bold", size = 12),
axis.text.x = element_text(angle = -90, hjust = 0, vjust = 0.5))
With the trained model, I can now make predictions for the holdout dataset. Pulling out the workflow:
gb_trained_wflow <- extract_workflow(gb_final_fit)
And making predictions:
gb_trained_wflow %>%
augment(holdout) %>%
transmute(Id, SalePrice = exp(.pred))
This submission made it into the top 20% of the Kaggle leader board, which goes to show how strongly boosting models work in settings like these, notably with very light feature engineering.
I hope this post has been interesting to you. In case of feedback, feel free to reach out.
Thank you for reading!
A work by Mathias Steilen